Measuring welfare in rearing piglets: test–retest reliability of selected animal-based indicators

Abstract The “Welfare Quality protocols” (WQP) were developed in 2009 as objective welfare assessment tools. The WQP are based on four welfare principles: 1) “good feeding”, 2) “good housing”, 3) “good health”, and 4) “appropriate behavior”. The included WQP-indicators were developed for growing pigs and are recommended for rearing piglets, although, to the authors’ knowledge, they have not been tested in this age class. Therefore, the present study tested selected indicators from different welfare assessment protocols with regard to test–retest reliability (TRR), consistency over time, in an on-farm study on rearing pigs. This allows to investigate whether the WQP-indicators developed for growing pigs can be recommended for rearing piglets and whether the additional indicators should be included in the WQP. In total 28 selected pen- or individual-level indicators were used by one observer to assess the animal welfare of rearing piglets on three pig farms. Per batch 40 to 125 piglets were randomly selected and individually marked to record the weekly assessments. This procedure was repeated in three consecutive batches per farm and resulted in a total of 759 rearing piglets being assessed. Spearman’s rank correlation co-efficient (RS), intraclass correlation co-efficient (ICC), and limits of agreement (LoA) were calculated to evaluate their TRR, especially if the TRR was influenced by the group of assessed animals (batch comparisons) or the age of the assessed piglets (age class comparisons). From the 28 indicators, 12 had a very low prevalence of <1% making an assumption about their TRR meaningless. From the pen level indicators, “sneezing” achieved acceptable TRR for both comparisons and “behavioral observations” (BO) achieved in general good values (e.g., “positive social behavior”: (RS: 0.34 to 0.89; ICC: 0.00 to 0.90; LoA ϵ [−2.93; 7.41] to ϵ [−18.9; 11.5]) for both comparisons (batch, age class). The WQP-indicators of sufficient TRR, such as “tail lesions”, “lameness”, “wounds on the body”, “human–animal-relationship test” and “BO”, cannot cover the four welfare principles adequately. In particular, problems remained with the welfare principles of “good feeding”, “good housing”, and partly “good health”. However, these grievances could be overcome by including further indicators from other sources outside the WQP which have acceptable to good results for TRR in this study, such as “back posture”, “ear lesions”, “normal behavior”, and “tail posture”.


Introduction
In the last two decades, the welfare of farm animals has increasingly become the focus of public attention and customers have demanded transparency regarding the housing conditions of the food-producing livestock (Broom, 2010;European Commission, 2016). The growing public discussion led to the Welfare Quality research project, during which the "Welfare Quality Animal Welfare Assessment Protocols" (WQP) were developed. These are based on the four following welfare principles: 1) "good feeding", 2) "good housing", 3) "good health", and 4) "appropriate behavior". Each of these principles is defined by different criteria char-acterized by one or more different indicators (Botreau et al., 2007;Blokhuis, 2008). The indicators of the WQP are mainly animal-based, as these can best reflect the actual state of the animal as opposed to resource-or management-based indicators, which rather assess the risk of the husbandry environment. However, it is well-known that animal-based indicators pose a challenge with regard to feasibility, validity, and reliability (Velarde and Geers, 2007;Blokhuis et al., 2010). Most of the indicators had been tested for their feasibility, validity, and reliability in preliminary studies on growing pigs (Forkman and Keeling, 2009) before they were included in the Welfare Quality protocol (WQP). However, to the authors' knowledge, no studies have yet been conducted on the TRR of the indicators in the age group of rearing piglets.
In Germany, the ongoing debate about welfare and transparency led to a legal change in the national law for animal protection (German designation: Tierschutzgesetz) in 2014. Since then, the owners of livestock have been legally obligated to undertake a farm self-monitoring using animal-based indicators (Anonymous, 2006a). In 2016, The Association for Technology and Structures in Agriculture (German designation: Kuratorium für Technik und Bauwesen in der Landwirtschaft e.V., KTBL) published a guideline for a German farm's self-monitoring of pigs, which is based on the WQP. However, reliability studies on the final guideline in the rearing period are lacking. Both welfare assessment protocols do not contain any indicators specifically selected for rearing piglets. It is recommended to use the existing indicators for growing pigs in the rearing period, although reliability has not been verified for this age (Welfare Quality, 2009;KTBL, 2016). Therefore, the overall aim of the present study was to clarify whether the recommended indicators for growing pigs are suitable for use in rearing piglets, as advised in the guideline and protocol. In particular, the TRR of the indicators in rearing piglets was evaluated by an on-farm study.
Reliability describes the similarity in the results of repeated measurements on the same object. On-farm reliability is mainly defined by the interobserver and the test-retest reliability (TRR; de Passillé and Rushen, 2005). The interobserver reliability describes the extent of equal results when different observers independently assess the same animals at the same time under the same circumstances. The TRR is the agreement between assessments performed by the same observer on the same objects at different points in time. It describes the consistency of a method over time, which is an important characteristic of the assessment protocols (Martin und Bateson, 1994;Temple et al., 2013). An indicator with a good consistency over time does not react to minor changes on-farm (e.g., different batches), but major changes should be detected by the measurement tool (Plesch et al., 2010). Therefore, measurement tools with a good TRR should achieve the same results even in the presence of minor changes in the on-farm situation (Windschnurer et al., 2009).
To evaluate the TRR, i.e., the consistency of the indicators over time, two different questions were considered. The first question aimed to clarify whether the TRR between the batches of all farms differed from each other. This was intended to give an advice on whether it causes a change with regard to the overall welfare assessment of a farm when different animals are assessed on the same farms. The assumption is that these minor changes (different batches) do not influence the welfare status of the farm. The second approach analyzed the TRR between the age classes of all farms. Thereby, the visits of one age class of all farms and batches were compared to the visits of the following (older) age class, i.e., the same animals were assessed. This comparison provided information on whether the age of the piglets was significant in the implementation of the monitoring. Therefore, selected WQP-indicators and selected indicators originated from other sources (other-indicators) were used with the overall goal to find feasible, viable and reliable indicators for the animal welfare assessment of rearing piglets.

Ethical statement
No animal experiments were carried out. By nature of welfare assessment protocols, the indicators used should not pose harm, suffering, injury, or stress to the animals. All procedures (e.g., marking of animals, routine health checks) were carried out primarily for the farmers' management purposes and not for scientific ones. The animals in the study were normally farmed animals and were housed conventionally or according to the EU organic scheme (Anonymous, 2007). In both cases, the animals were kept according to EU as well as national law "German Animal Welfare Act" (German designation: TierSchG; Anonymous, 2006a) and the "German Order for the Protection of Production Animals used for Farming Purposes and other Animals kept for the Production of Animal Products" (German designation: TierSchNutztV, Anonymous, 2006b).

Data collection
Data collection was carried out on three farms with a closed system in Schleswig-Holstein, northern Germany, between October 2020 and February 2022. The farms participated on a voluntary basis and differed in production factors to Two values for this cell, since the piglets were rehoused within the rearing period.
Witt et al.

3
increase variance. This allows to obtain a better indication of the TRR and possible influence on the indicators by the age of the animals and/or by changing animal groups (batches). The differences in production factors are shown in Table 1. Three farms were visited by the same observer, who had been trained in the assessment of animal welfare indicators on farm by a member of the Welfare Quality Network before the study started. The study started with the random selection of a group of sows to be farrowed, ranging from 4 to 10 sows depending on the different herd sizes of the farm (Farm A: 50 sows, Farm B: 1400 sows; Farm C: 60 sows). The first visit to the animals in this study, i.e., the piglets of the randomly selected sows, took place in the week after farrowing. Due to management reasons, the piglets were individually marked on these farms, which made the animals recognizable at every visit. The following visits were usually carried out weekly (with a few exemptions due to disease, holidays, or weather conditions). For the weekly assessment, the piglets were restrained as a part of routine health checks. The piglets were crossbred (dam genetics: Large White, Danish/German/Norwegian Landrace, Danish Yorkshire; sire genetic: Pietrain).
The exact number of piglets that became part of this study depended on the number of piglets born alive from the group of sows and ranged between 40 and 120 piglets per batch (batch = time period from birth to slaughter). Three batches per farm were visited weekly over a period of 1 yr so that in total 759 piglets were assessed. Only the assessments collected in the rearing barns were included in this study. The detailed sizes of batches during the rearing period is shown in Table 2.
As mentioned, the randomly selected sow groups of the assessed piglets varied in size due to farm management. This led to different numbers of piglets per batch, especially in Farm C. The time piglets were kept in the rearing barns varied between the farms. This was due to different husbandry practices (organic: suckling period of 50 d) and fattening barn capacities on the farms. In general, the piglets of Farm A remained with the lactating sow in the farrowing barn for 3 wk. Thereafter, they were re-arranged into a group suckling barn with the lactating sow as well as two other sows and their litters (up to 36 piglets, depending on the litter size) in a community pen for approximately another 30 d. On the day 50 of life, the piglets of the organic Farm A were weaned and transferred to the piglet rearing barn in the already formed groups. This shortens the rearing period of Farm A about 3 wk compared to the conventional farms. The piglets from Farms B and C were weaned after the suckling period of 4 wk and transferred to the rearing barn. The piglets were transferred to another pen on Farms B and C within the rearing period, which resulted in two different values for "space per piglet" (compare Table 2). The rearing period ended for all farms between days 63 and 90 of life, which corresponded to a live weight of approximately 30 kg. This wide range of the end of the rearing period was due to capacities available in the fattening house. This resulted in a different number of days in the rearing barn and thereby in a different number of visits per batch as shown in Table 2. The piglets on Farms A and C were fed twice a day by hand in a long trough with dry feed according to the growth curves. The piglets on Farm B were fed several times a day by liquid feeding.
The piglets of Farms A and C had undocked tails. Those of Farm B had docked tails. There were no major changes regarding the farm's management (e.g., weaning time) throughout the data collection.

Protocol assessments
The weekly assessment on the individually marked piglets were based on selected health and welfare indicators from different protocols that had been recommended with regard to use in regular self-assessment to fulfill the national law requirement. In particular, the included indicators were on one side derived from the WQP for pigs-hereafter called "WQP-indicators". On the other, they were derived from other sources, such as the German guideline for farm self-monitoring, the German literature database (Welfare Quality, 2009;KTBL, 2016;NaTiMon, 2021) and standard health checks from veterinary routine practice (Baumgartner, 2009) -hereafter called "other-indicators". The resulting individual protocol for this study contained on the one side pen-level indicators (PIN), all originated from the WQP, of which the "behavioral observations" (BO) such as the "human-animal-relationship test" (HAR) were considered separately. On the other, the resulting protocol for this study contained animal-based individual-level indicators (IIN), partly from WQP and partly from other sources, which are described below. Hence, this is a comprehensive study of reliability of welfare indicators described in literature up to date for potential use in rearing piglets. The complete list of indicators, their source, their welfare principle, their definitions, and scoring scale are presented in Table 3. The indicators were either scored using a three-point scale (category 0 = absent, category 1 = light  appearance, category 2 = strong appearance) or a two-point scale (category 0 = absent, category 1 = present). The indicators were used following the guidelines and explanations of the respective protocols/descriptions they originated from. The classification of the WQP-indicators according to the four welfare principles ("good feeding", "good housing", "good health", "appropriate behavior") provides an overview of the part of animal welfare for which the indicator is relevant. Detailed explanations as well as descriptions can be found in the WQP for pigs, the German guideline for farm self-monitoring, the German literature database (Welfare Quality, 2009;KTBL, 2016;NaTiMon, 2021) and standard health checks from veterinary routine practice (Baumgartner, 2009).

Pen-level indicators.
The weekly visits were scheduled in the morning after the staff had already carried out an animal check in the morning. The assessments of the PIN were carried out at the beginning of each visit. The first step was to watch the pens and count how many piglets were lying, "huddling", "panting", or "shivering". After that, the pens were entered by the observer and the "HAR" was carried out. At the same time, the pen was checked for any liquid manure ("scouring"). After making the animals stand up, the number of "coughs" and the number of "sneezes" per pen occurring were recorded for 5 min. For further calculation, the number of "coughs"/"sneezes" in the 5 min was divided by the number of animals under observation.

Human-animal-relationship test.
After the assessment of the PIN, the "HAR" (pen-level) was carried out to evaluate the reaction of the piglets towards the observer. Therefore the corresponding pens were entered. The observer first walked clockwise along the pen wall (inside the pen) and then stood still in the middle of the pen for 30 s. Then, the observer walked counter clockwise along the pen wall and analyzed whether the piglets showed a panic response, e.g. fleeing or huddling in the corner. There was no physical interaction or talking to the animals while the observer performed the test. For further analysis of the "HAR", the percentage of pens with a panic response from the total observed pens per farm was taken into account.

Behavioral observations.
The "BO" (pen-level) followed after counting the number of "coughs" and "sneezes". Thereby, the piglets had 5 min time to calm down after the animals had been encouraged to stand up. Each pen was then scanned by the observer every 2 min and the piglets sorted into categories: "positive social behavior", "negative social behavior", "pen investigation", "use of enrichment material", "other active behavior", or "resting". A total of five scans were performed per pen. For further analysis, the "BO" was expressed as a percentage of the total active behavior, which was all behavior, except resting.

Individual-level indicators.
All individually marked piglets were scored from both sides for a variety of IIN, e.g., "bursitis", "lameness", and "wounds on the body" (Table 3).

Statistical analysis
Data were processed first in Microsoft Excel (Microsoft Corporation, 2016), then in the statistical software SAS 9.4 (SAS Institute Inc., 2017). Eligibility checks were carried out before further analysis to rule out any potential transmission errors. In order to evaluate TRR, i.e., the consistency over time of the welfare indicators, the comparability between the batches and the comparability of the age classes were determined. The TRR of the three batches assessed on each farm was compared to investigate whether the TRR had been influenced by the group of animals assessed. To also evaluate a possible influence of the age of the animals on the TRR, the weekly assessment visits were classified into nine age classes, and one age class was always compared to the following age class (the visit in the following week). For both comparisons, the different categories (0, 1, and, if applicable, 2) of each indicator were treated as an independent variable and analyzed as percentages of animals assigned to the corresponding category per visit (fictive example for clarification: visit week 1: "lameness" category 0: 98% of the assessed animals, "lameness" category 1: 1.5% of the assessed animals, "lameness" category 2: 0.5% of the assessed animals). The percentages achieved in the two consecutive visits were then compared with regard to TRR. For the evaluation of the TRR of the two comparisons (batch, age class), a combination of different reliability and agreement parameters was calculated using the statistical software SAS 9.4 (SAS Institute Inc., 2017) as described below. This combination of indicators is commonly advised and has been used in comparable reliability studies (de Vet et al., 2006;Czycholl et al., 2016;Can et al., 2017;Friedrich et al. 2019aFriedrich et al. , 2019b. Interpretation of values was carried out with regard to existing literature to ensure comparability (Temple et al., 2013;Czycholl et al., 2016;Friedrich et al. 2019aFriedrich et al. , 2019b and to general recommendations concerning these parameters (Martin und Bateson, 1994;McGraw und Wong, 1996;de Vet et al., 2006).

Spearman's rank correlation co-efficient (RS).
The RS is a nonparametric measure of rank correlation (Gauthier, 2001). The RS ranges between −1 and 1, whereby values closer to 1 resemble a higher correlation and a greater confidence for TRR. RS ≥ 0.40 was interpreted as acceptable reliability and RS ≥ 0.70 as good reliability.

Intraclass correlation co-efficient (ICC).
The ICC places the variance between study objects in proportion to the variance between study objects plus measurement error (de Vet et al., 2006). In accordance with Shrout and Fleiss (1979), a two-way model was used. The co-efficient can reach values between 0 and 1. For interpretation an ICC ≥ 0.40 was defined as acceptable and an ICC ≥ 0.70 as good reliability.

Limits of agreement (LoA).
The LoA estimates the differences between two sets of measurement values and the standard deviations of these differences (Bland and Altman, 1986). The differences between the compared visits in the present study were expressed as a percentage, i.e., they range between −100 and 100. Values close to −100 reflect higher classification of the assessed animals in the second visit, whereas values towards 100 indicate that the animals were more often categorized in higher categories of the indicators in the first visit. An interval ≤ −10.0 to 10.0 implied acceptable agreement and an interval ≤ −5.00 to 5.00 implied good agreement.

Results
To enhance readability, only those categories of the respective indicators representing presence of a welfare issue, i.e., an observation, are shown. Thus, category 0 (0 = absent) is generally not shown.

Pen-level indicators (PIN)
The mean values in percent as well as the standard deviation of the three batches and the nine age classes for the PIN are visualized in Table 4. "Coughing" and the three-point scale indicators "panting" and "shivering" as well as "huddling" and "scouring" (both only category 2) had a prevalence of less than 1% across all batches respectively age classes. Therefore, these indicators were not evaluated further. Of the remaining PIN, "HAR" had the highest prevalence with up to 86.8% (batch 3), whereby the prevalence (panic reaction of the piglets) tended to decrease with advancing age (increasing age class). This decreasing trend in prevalence with increasing age could also be observed in "huddling" (category 1). In contrast, "sneezing" and "scouring" (category 1) indicated no clear discernible trend and had a prevalence of 0% to 1.7% and 0% to 25%, respectively across all batches and age classes. Table 5 includes the statistical parameters for the indicators mentioned in Table 4 without the indicators of the "BO", which are shown in Table 6. "Scouring" (category 1) achieved acceptable to good results for the TRR especially for the ICC. In contrast, "sneezing" showed acceptable to good results for the RS but not for the ICC. "HAR", on the other side, achieved acceptable to good results for both reliability parameters. All the indicators mentioned rarely achieved acceptable or good values for the LoA except "sneezing".
The missing values for some statistical parameters in Tables 5 and 6 resulted from increased ties or a lack of variance. Thus, it was not sensible to calculate the statistical parameters due to the small number of piglets (relocation to the fattening stable), especially in the later age classes.
The results of the "BO" in Table 4 are expressed as a percentage of total active behavior, excluding resting (as advised by the WQP). The values for "positive social behavior" ranged between 5.5% and 12.7%, for "negative social behavior" between 1.7% and 5.9%, for "pen investigation" between 8.3% and 22.9% and for "use of enrichment material" between 3.6% and 25.0% across all batches and age classes.
The TRR calculation for the categories of the "BO" showed overall acceptable to good results for the statistical parameters RS and ICC for both comparison approaches (batches, age classes) except for "negative social behavior" (Table 7). All other categories of "BO" achieved good values (>0.7) for the three batch comparisons for the ICC. Further, the results for the TRR of the age class comparisons were in general acceptable to good for the categories of the "BO" mentioned. In contrast, rarely good values were achieved for the LoA.

Individual-level indicators (IIN)
The mean values in percent as well as the standard deviation of the three batches and nine age classes for the IIN are visualized in Table 7. In principle, most IIN had a low prevalence. Therefore, they had low mean values across all batches and age classes. The two-point indicators with a prevalence close to zero were "abdominal wall", "body condition", "eye alterations", "neurological disorders", "pumping", "rectal prolapse", "reddened eyes" as well as "twisted snout". For the three-point indicators, these were "bursitis" (both categories), "hernias" as well as "skin condition" (both category 2), which is why these indicators were not analyzed further and thus not presented. All remaining indicators had mean values below 5% except "ear lesions", "skin condition", "tail length", "tail lesions", "tail posture", and "wounds on the body" (all category 1), whereby "tail length" (category 1) had the highest mean value over the three batches with 52.6%, followed by "ear lesions" with 18.5%.
The TRR calculation for the statistical parameters are presented in Table 8 for the WQP-indicators IIN and in Table 9 for the other-indicators (compare Table 3). In general, batch comparisons are of lower TRR than age class comparisons for IIN. For example, "hernias" (category 1), the values for

Test-retest reliability assessment
This study tested different animal welfare indicators for their TRR, especially with regard to a possible influence of the age of the assessed piglets as well as different animal groups (batches). The results varied widely depending on the indicator. Nevertheless, to cover all different aspects of welfare (and therefore the four welfare principles), a number of dif-ferent indicators need to be included in welfare assessment protocols. There are different reasons that can lead to a low TRR (Temple et al., 2013). However, distinguishing between the different reasons was not the aim of the present study.
As general limitation of this study, it should be born in mind that the farms participated on a voluntary basis and that there were no known health problems on these farms. In addition, all farms are located in the same region with similar climatic conditions. Hence, the assumptions made are based on this and generalization to e.g., countries with differing climate may be problematic. Moreover, the included welfare related indicators based on the existing literature and existing recommendations may not be comprising all welfare indicators potentially available, especially as research on applicable welfare indicators is an on-going process. Table 5. Spearman's rank correlation co-efficient, RS; intraclass correlation co-efficient, ICC; limits of agreement, LoA for the pen-level indicators for the comparisons (C) of the three batches (B) and nine age classes (A) indicating poor (normal type), acceptable (italic type) and good agreement (bold type Indicator with a three-point-scale, but category 2 had a prevalence close to 0 and is therefore not presented.

8
Journal of Animal Science, 2023, Vol. 101 For statistical analysis, a combination of reliability (RS, ICC) and agreement (LoA) parameters was used. De Vet et al. (2006) advised the use of several parameters for the evaluation of reliability to compensate for disadvantages as the interpretation of only one parameter can result in misinterpretation. Moreover, this combination of parameters has been used for other studies concerning the reliability of welfare assessment protocols and thus further ensures comparability to literature (e.g., Temple et al., 2013;Czycholl et al., 2016;Can et al., 2017;Friedrich et al. 2019aFriedrich et al. , 2019b. According to Wirtz and Caspar (2002), when interpreting the reliability parameters (RS, ICC), it is important to note that they depend on the total variance of the study objects, i.e., if variance among study objects is low, reliability may be underestimated. To indicate acceptable TRR, the statistical parameters should reach the values of acceptability (RS, ICC: ≥ 0.40; LoA: ≤ −10.0 to 10.0). But if variance strongly affected the explanatory power of the reliability parameters, reliability can still be interpreted as sufficient in some cases even if the reliability parameters are not acceptable (but agreement is good). If the opposite effects occur-namely acceptable reliability values without acceptable agreement values-an existing relationship can still be assumed and conclusions can be drawn between the farm visits. Although in this case, no exact agreement is present, which may limit the application of those indicators for some purposes; the order and thus the ranks of the animals remain the same, leading to good values for the reliability parameters (Czycholl et al., 2018). Hence, for purposes in Table 6. Spearman's rank correlation co-efficient, RS; intraclass correlation co-efficient, ICC; limits of agreement, LoA for the single categories of the "behavioral observations" (BO) for the comparisons (C) of the three batches (B) and nine age classes (A) indicating poor (normal type), acceptable (italic type), and good agreement (bold type) which specifically the ranking of e.g., farms would be important, these indicators may still be useful.

Pen-level indicators
Considering the prevalence of the selected indicators in this study, it is not possible to give a conclusive assessment for the following PIN: "coughing", "panting", "shivering" as well as "huddling" and "scouring" (both category 2). It might be that these indicators are not relevant in this phase of life (rearing period) in well-managed farms in regions without extreme climatic conditions. The lack of variance between the farm visits made it insensible to calculate RS for "huddling" (category 1), especially in the later age groups. However, the other statistical parameters indicate low TRR.
Further, the prevalence is low for "sneezing" with mean values ranging from 0.2% to 1.7% but in a similar range to the study by Friedrich et al. (2019b) with suckling piglets and slightly higher values compared to the study by van Staaveren et al. (2018) with rearing piglets. The acceptable to good TRR detected in this study is likewise similar to the results of Friedrich et al. (2019b). This is in contrast to Temple et al. (2013), who detected poor TRR for "sneezing" and at the same time much higher mean values with 19.7% in growing pigs. The large differences can possibly be explained by different climatic conditions and a higher age of the animals. In addition, Temple et al. (2013) visited 15 different farms, which significantly increased the variance of the indicator "sneezing", as determined by the high standard deviation.
The high prevalence and high variance between farms for "scouring" (category 1) compared to other studies should be taken into account when interpreting the present results (Temple et al., 2013;van Staaveren et al., 2018;Friedrich et al., 2019b). This is due to the fact that the outdoor arena of the organic farm often implied a case of "scouring", according to the definition of the indicator, if it had previously rained a lot. Therefore, the question of the validity of the indicator arises, especially in pens with an outdoor arena. However, despite the suggestion that precipitation influences the results, good TRR was achieved in this study.
In summary, only "sneezing" and "scouring" (category 1) of the PIN achieved good results for the statistical parameters, but the latter indicator was influenced by the weather conditions in the outdoor arena.

Human-animal-relationship test (HAR)
The prevalence of the flight reaction decreased with increasing weight and age of the piglets. This decreasing tendency explains the results of acceptable to good results for the reliability parameters (RS, ICC), while no exact agreement was achieved, as shown by the LoA. The effect of the age was also recognized in the studies by Czycholl et al. (2016) and Mieloch et al. (2020) with growing pigs. It should be noted that unlike the studies mentioned, the piglets were visited weekly in this study. It can be assumed that the animals became used to the observer's weekly visits and no longer showed any panic reactions. This assumption is supported by the conclusion of de Passillé and Rushen (2005) that the "HAR" may be too sensitive and susceptible to minor changes. It is therefore not recommended to carry out the test on very young piglets, which are very anxious due to the change of pen and the new environment. The batch comparisons, which correspond more to the period of a self-assessment, achieve in general acceptable results for RS and ICC. This is similar to the results of Temple et al. (2013). However, overall, TRR is lower than in the age class comparisons, thereby revealing problems in the assessment of different groups of animals (batches).

Behavioral observations
The recorded prevalences were in similar ranges to the prevalences from studies with growing pigs (Temple et al., 2011;Czycholl et al., 2016). Only an increased tendency for the category "use of enrichment material" could be determined, which can be explained by the straw and the feeding racks in the pens, classified as enrichment material. Moreover, it is known that the playing behavior, which includes the use of "Abdominal wall", "body condition", "bursitis" (both categories), "eye alterations", "hernias" (category 2), "neurological disorders", "pumping", "rectal prolapse", "reddened eyes", "skin condition" (category 2), "twisted snout" had a prevalence close to zero and are not shown. enrichment material, increases in the first 6 wk of life and then declines to lower levels by week 14 of life (Brown et al., 2015). The statistical parameters achieved overall acceptable to good results for the categories "positive social behavior", "pen investigation", and "use of enrichment material" for all comparisons of the batches, comparable to the results of Czycholl et al. (2016) and Friedrich et al. (2019a). In contrast, Temple et al. (2011) detected low TRR in growing pigs for all categories of the BO. According to Temple et al. (2011), the cause for the low TRR in their study was the improvement of experience in the assessment of the observer during the data collection, i.e., this result was caused by the experimental setup and is therefore not directly comparable to the studies of Czycholl et al. (2016), Friedrich et al. (2019a), and those of the present study. In the present study, again, the results of the comparisons from the age classes achieved better results for the statistical parameters than the comparisons of the batches. The comparison of the later age classes sometimes showed opposite results, because the total number of animals decreased (relocation to the fattening stable) and therefore there was more space in the pens. This resulted, for example, in less social behavior because the piglets were able to keep their individual space (i.e., real differences). An exception to the general acceptable to good TRR of the "BO" in this study was the category "negative social behavior". The low TRR for the category "negative social behavior" could be explained by the lower variance from the data and by real differences between the visits. For example, the observer sometimes recorded increased "negative social behavior" during a tail biting outbreak. Table 8. Spearman's rank correlation co-efficient, RS; intraclass correlation co-efficient, ICC; limits of agreement, LoA for the comparisons (C) of the three batches (B) and nine age classes (A) for the individual-level indicators originated from the WQP for pigs (Welfare Quality, 2009)

Individual-level indicators selected from the WQP
Considering the prevalence from the selected indicators sourced from the WQP (WQP-indicators), it is not possible to give a conclusive assessment for the following IIN: "abdominal wall", "body condition", "bursitis", "pumping", "rectal prolapse", "twisted snout" as well as "hernias", "lameness", and "skin condition" (all category 2). The reasons (e.g., climatic conditions, health status of the farms, age of the pigs) that led to the low prevalence of the aforementioned IIN were probably the same as those already discussed for PIN. Van Staaveren et al. (2018) recorded similar prevalence for "hernias", "rectal prolapse" as well as "twisted snouts" but slightly higher values for "body condition" (4.4%) at the beginning of the rearing period and higher values for "bursitis" (3.9%) at the end of the rearing period. This difference can prob-ably be explained by the fact that in contrast to the study of van Staaveren, in the present study, small and thin piglets stayed longer in the farrowing unit (i.e., were not found in the rearing period which was solely assessed). Czycholl et al. (2016) noted a prevalence of 30.0% for "bursitis" (category 1) in growing pigs and found a poor TRR for "bursitis". The authors justified this with the fact that it was difficult to assess due to insufficient light in the pens and moving animals. The difference in prevalence for rearing piglets and growing pigs suggests that "bursitis" at a young age is not a major problem, but develops at a later age. The remaining WQP-indicators with higher prevalence (in category 1) are "wounds on the body", "lameness", "skin condition", "hernias", and "tail lesions". In general, the prevalence in category 2 was very low (for indicators with a Table 9. Spearman's rank correlation co-efficient, RS; intraclass correlation co-efficient, ICC; limits of agreement, LoA for the comparisons (C) of the three batches (B) and nine age classes (A) for the individual-level indicators originated from the German guideline for farm self-monitoring, German literature database (KTBL, 2016;NaTiMon, 2021) and standard health checks from veterinary routine practice (Baumgartner, 2009)  An exceptional value of 10% was recorded for "wounds on the body" in age class 5. This was probably caused by the fact that eight pen partitions of one compartment were not closed correctly during batch 2 from Farm C and the 40 animals had mixed overnight, which led to rank fights. This circumstance fulfills the condition "real differences" (Temple et al., 2013). In general, the impact of rank fighting is visible for "wounds on the body" as the prevalence decreases in older piglets, i.e., the age classes. In newly arranged groups, fights occur to establish a rank order (Meese and Ewbank, 1973). These rank fights could be the reason for a low TRR for the age class comparisons, because there were real differences between the following visits. The comparisons of the batches partially achieved acceptable results for the statistical parameters. Similar TRR studies have achieved acceptable to good results for "wounds on the body" (Temple et al., 2013;Czycholl et al., 2016;Friedrich et al., 2019b).
The present study revealed acceptable to good results for "lameness", except for the statistical parameter RS within the batch comparisons, which indicates ties in the data and again problems with the evaluation of different animal groups in different batches. Likewise, Czycholl et al. (2016) obtained acceptable to good results for the TRR, but they did not use the RS and ICC as parameters. In contrast Temple et al. (2013) and Friedrich et al. (2019b) found poor TRR for "lameness". Friedrich et al. (2019b) interpreted that it was difficult to identify lame animals when the piglets moved as a group. They recommended an individual assessment, which was carried out in the present study and obviously improved reliability results.
Acceptable results were achieved for "skin condition", which stands in accordance with the results of Friedrich et al. (2019a; in sows) and Czycholl et al. (2016;in growing pigs). In contrast, Temple et al. (2013) detected poor TRR results for growing pigs and argued that "skin condition" has unpredictable variations over time and affects farms sporadically. The conditions in this study differed in that the piglets were scored weekly and skin changes were sometimes visible for longer. Moreover, sunburn occurred regularly on the organic farm at certain times of the year.
Further, the indicator "hernias" achieved acceptable TRR overall in the present study. The lower TRR than in the study by Czycholl et al. (2016) for growing pigs is related to the age of the piglets, which influences the size of the hernias. In addition, small hernias can be overlooked more easily.
"Tail lesions" achieved in general acceptable to good TRR for both comparisons in the present study. This confirms the results of Czycholl et al. (2016) with good TRR for "tail lesions" in growing pigs. In contrast, Temple et al. (2013) determined a low TRR for "tail lesions" and argued that indicators with a high intrafarm variability might not be detected by the sampling method. To generate accurate estimates for relatively rare indicators, such as "tail lesions", a higher percentage of animals per pen would have to be sampled than recommended in the WQP. The present study, in which all piglets were assessed during the farm visit, confirms this reasoning.
In summary, out of 10 indicators originating from the WQP in this study, "wounds on the body", "lameness", "skin condition", "hernias", and "tail lesions" achieved acceptable to good results in TRR in rearing piglets. However, this only applies to category 1, as the prevalence was too low for the calculation of the TRR in category 2 for the three-point scale indicators.

Individual-level indicators selected from other sources
Considering the prevalence from the selected indicators originated from other sources, it is not possible to give a conclusive assessment for the following IIN: "abdominal wall", "eye alteration", "neurological disorders", and "reddened eyes". The remaining other-indicators, which are discussed further in the following, are "tail length", "tail posture", "normal behavior", "back posture", "claw alterations", and "ear lesions".
"Normal behavior" achieved acceptable results with regard to TRR in both comparisons (age class, batch). It can, thus, be recommended for use. However, "back posture" only achieved acceptable results in the age class comparison, but not in the batch-wise comparisons. This is most likely due to the fact that in the batch-wise comparison, other animals were assessed. These may have had a different health status ("real differences", first reason for a low TRR). Moreover, "back posture" was prone to influences by the observer entering the pen, i.e., TRR may be enhanced by an evaluation from outside the pen. Moreover, clearer categorization might be helpful, as it was not easy even for the same observer to assess which smallest change in the "back posture" had to be assessed as category 1, especially in a big pen with moving animals. Overall, "back posture" is an important indicator of early disease detection and thus should be implemented during daily routine checks as well as during farms' self-assessments. However, the usefulness in longer term welfare assessments needs further studies with clearer classification criteria and an evaluation from outside the pen.
Acceptable to good reliability results were achieved for the three-point scale indicator "tail length" in the present study. The good results for category 1 (docked tails/partial loss) are probably due to the docked tails of the piglets from Farm B. It would be advisable to verify the results on undocked piglets. The different management decisions of tail docking should be taken into account in the interpretation of the results. Nevertheless, the indicator can be recommended for the use in on-farm welfare assessment protocols. These results are confirmed by the good interobserver reliability in the study by Pfeifer et al. (2019).
"Tail posture" achieved acceptable results for the reliability parameters at least for the age class comparisons, whereby again the docked tails on Farm B limit the results of the present study, as assessing of tail posture on docked tails is difficult. Moreover, the observer entered the pen for the assessment thereby changing the tail posture of the animals. This could be the reason for the low prevalence of "tail posture", which ranged between 0.00% and 8.10% for the age classes, compared to a prevalence of 0.30% to 53.0% in the study of Drexl et al. (2022). It is noticeable that almost no exact agreement (no acceptable or good values for the LoA) was achieved. However, still, given the acceptable reliability, "tail posture" can be used for self-monitoring on-farm. This supports the findings of Pfeifer et al. (2019), who suggested the limitation that "tail posture" should be evaluated by the same observer.
"Claw alterations" had an acceptable to good reliability in terms of RS, ICC, and for LoA in the present study. The indicator had already been proven with regard to its TRR in the study by Friedrich et al. (2020) in sows with similar results. However, in the same study, the authors detected a low interobserver reliability, which may be problematic with regard to longer term welfare assessment. As to date, no knowledge about the interobserver reliability of this indicator on rearing piglets exists, thus it can be recommended for use especially with regard to self-monitoring on farm.
The TRR of "ear lesions" can be interpreted as sufficient, which is in accordance with the good results for the interobserver reliability of Pfeifer et al. (2019).
To summarize, out of 10 indicators originating from other sources than the WQP in this study, the following 6 achieved acceptable to good results in TRR in rearing piglets: "tail length", "tail posture", "normal behavior", "back posture", "claw alterations", and "ear lesions".

The four welfare principles
Of all the selected WQP-indicators considered in this study, the best results in terms of TRR in rearing piglets were achieved by: "sneezing" as well as "BO" from the PIN and "lameness" as well as "tail lesions" from the IIN. Thus, it would not be possible to cover the four welfare principles in the rearing period with the WQP-indicators, especially in the principles "good feeding", "good housing" and partly "good health" problems occur. The reasons for a low TRR of these WQP-indicators (time of occurrence of the health problem, age of the animals, etc.) have already been discussed in detail. With the inclusion of the additional indicators derived from other sources than the WQP, a comprehensive assessment of the different welfare principles becomes more possible, especially for the welfare principle "good health" ("ear lesions", "normal behavior", "back posture" etc.). However, no adequate indicators for "good feeding" and "good housing" could be identified, which may also be due to the management, the localization, and climate management of the farms. For the welfare principle "appropriate behavior", the "BO" could be complemented by "tail posture" and replace the "HAR". Given the fact that this study chose to include comprehensive welfare indicators for rearing piglets as described in literature up to date, this is important knowledge for welfare assessment of rearing piglets in the future.

Conclusion
The aim of the present study was to assess whether indicators for growing pigs can be used for rearing piglets, as has been recommended so far. To clarify this question, the TRR of the indicators were evaluated and it was determined whether the TRR is influenced by the group of assessed animals (batch) or the age of the assessed piglets. To do this, 28 indicators were selected from different self-monitoring protocols and tested regarding their TRR. Nine out of eighteen indicators (and for four indicators in category 2) originating from the WQP had a prevalence close to zero, which made a calculation for statistical parameters meaningless. The four welfare principles, which together cover the multidimensional field of animal welfare and on this basis the WQP, cannot be adequately assessed with the remaining WQP-indicators (low prevalence for rearing piglets in this study). In particular, there are problems with the welfare principles of "good feeding", of "good housing", and partly "good health". To detect injuries or diseases (welfare principle "good health"), which are often at an early stage in rearing piglets, "back posture", "ear lesions", and "normal behavior" as indicators originated from other sources could be included in the WQP because they achieved acceptable to good results for the TRR. These indicators can complement the WQP-indicators with acceptable to good TRR, such as "BO", "HAR", "lameness", "sneezing", "tail lesions", "wounds on the body". However, no adequate indicators for "good feeding" and "good housing" could be identified.